hello, world!
# General packages for stuff
library(tidyverse)
library(here)
library(janitor)
library(plotly)
# Packages for spatial stuff & point pattern analysis
library(tmap)
library(sf)
library(spatstat)
library(maptools)
library(sp)
library(raster)
# Packages for cluster analysis:
library(NbClust)
library(cluster)
library(factoextra)
library(dendextend)
library(ggdendro)
voles <- read_sf(dsn = here("data","redtreevoledata"),
layer = "ds033") %>%
dplyr::select(COUNTY) %>%
filter(COUNTY == "HUM") %>%
st_transform(crs = 4326)
st_crs(voles) #check projection
## Coordinate Reference System:
## EPSG: 4326
## proj4string: "+proj=longlat +datum=WGS84 +no_defs"
# Get Humboldt County outline
humboldt <- read_sf(dsn = here("data","redtreevoledata"),
layer = "california_county_shape_file") %>%
filter(NAME == "Humboldt") %>%
dplyr::select(NAME)
st_crs(humboldt) <- 4326
Initial visualization of the data
plot(voles)
plot(humboldt)
# Plot them together with tmap:
tm_shape(humboldt) +
tm_fill() +
tm_shape(voles) +
tm_dots(size = 0.2)
# Or with ggplot2:
ggplot() +
geom_sf(data = humboldt) +
geom_sf(data = voles)
# Geocomputation in R by Robin Lovelace, free and online
We want to explore point patterns in a few different ways. Quadrat analysis, nearest neighbor analysis, etc. to compare with.
First we need to convert to ‘ppp’ and ‘owin’ - the points and windows, as used by maptools and spatstat (because sf is still catching up for raster and point pattern analysis stuff)…
## UPDATE!
voles_sp <- as(voles,"Spatial") #ensure R recognizes the voles data set as spatial data. Some functions dont like sf data
# voles_ppp <- as(voles_sp, "ppp") #projection error with package, allison will update later
# Clean data
iris_nice <- iris %>%
clean_names()
# Explore data
ggplot(iris_nice) +
geom_point(aes(x = petal_length, y = petal_width, color = species))
# we think it looks like there are 3 clusters
# Ask R: how many clusters do U think there should be for this dataset?
number_est <- NbClust(iris_nice[1:4], #the brackets tells R to only conisder data in columns 1-4
min.nc = 2,
max.nc = 10,
method = "kmeans") #use ?NbClust to see all the diff methods that can be used
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 11 proposed 2 as the best number of clusters
## * 11 proposed 3 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
## * 1 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
graph output = count of how many algorithms thought the data had x number of clusters. Here the algorithms think 2 groups is most likley, but R is NO substitute for assessing the data visually yourself and we think that since there are 3 species 3 clusters make sense
# Run kmeans
iris_km <- kmeans(iris_nice[1:4], 3) # kmeans specifying 3 groups!
iris_km$size
## [1] 33 21 96
iris_km$centers
## sepal_length sepal_width petal_length petal_width
## 1 5.175758 3.624242 1.472727 0.2727273
## 2 4.738095 2.904762 1.790476 0.3523810
## 3 6.314583 2.895833 4.973958 1.7031250
iris_km$cluster # use to determine what cluster R thinks each observation belongs to
## [1] 1 2 2 2 1 1 1 1 2 2 1 1 2 2 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 1 1 1 2 1 1
## [38] 1 2 1 1 2 2 1 1 2 1 2 1 1 3 3 3 3 3 3 3 2 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3
## [75] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [149] 3 3
size outputs have to add up to 100
# Bind the cluster number to the original data
iris_cl <- data.frame(iris_nice, cluster_no = factor(iris_km$cluster)) #can also use as.factor
# Plot
ggplot(iris_cl) +
geom_point(aes(x = sepal_length, y = sepal_width, color = cluster_no))
A little better…
ggplot(iris_cl) +
geom_point(aes(x = petal_length,
y = petal_width,
color = cluster_no,
pch = species)) +
scale_color_brewer(palette = "Set2")
# Or, a 3D plot with plotly
plot_ly(x = iris_cl$petal_length,
y = iris_cl$petal_width,
z = iris_cl$sepal_width,
type = "scatter3d",
color = iris_cl$cluster_no,
symbol = ~iris_cl$species,
marker = list(size = 3),
colors = "Set1")
Hierarchical cluster analysis (dendrograms) in R
Relevant functions:
stats::hclust() - agglomerative hierarchical clusteringcluster::diana() - divisive hierarchical clusteringWe’ll be using WorldBank environmental data (simplified), wb_env.csv
# Get the data
wb_env <- read_csv(here("data", "wb_env.csv"))
# Only keep top 20 greenhouse gas emitters (for simplifying visualization here...)
wb_ghg_20 <- wb_env %>%
arrange(-ghg) %>%
head(20)
# Scale it so all variables are on the same scale (can consider this for k-means clustering, too...)
wb_scaled <- as.data.frame(scale(wb_ghg_20[3:7]))
rownames(wb_scaled) <- wb_ghg_20$name # Add back in the rownames (country name) to scaled dataset
diss <- dist(wb_scaled, method = "euclidean", upper=TRUE)
# Use euclidean distances to do some complete agglomerate clustering [Hierarchical clustering (complete linkage)]
hc_complete <- hclust(diss, method = "complete" )
# plot outputs
plot(hc_complete, cex = 0.6, hang = -1)
# plot the same thing in ggplot
ggdendrogram(hc_complete, rotate = T) +
theme_classic() +
labs(x = "Country")
things clustered closer on the dendeogram = closer relation in multivariate space, countries further apart = more dissimilar